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ABSTRACT 

Structural variation is an important class of genetic 
variation in mammals. High-throughput sequencing 
(HTS) technologies promise to revolutionize copy- 
number variation (CNV) detection but present sub- 
stantial analytic challenges. Converging evidence 
suggests that multiple types of CNV-informative 
data (e.g. read-depth, read-pair, split-read) need be 
considered, and that sophisticated methods are 
needed for more accurate CNV detection. We 
observed that various sources of experimental 
biases in HTS confound read-depth estimation, and 
note that bias correction has not been adequately 
addressed by existing methods. We present a novel 
read-depth-based method, GENSENG, which uses a 
hidden Markov model and negative binomial regres- 
sion framework to identify regions of discrete copy- 
number changes while simultaneously accounting 
for the effects of multiple confounders. Based on ex- 
tensive calibration using multiple HTS data sets, we 
conclude that our method outperforms existing 
read-depth-based CNV detection algorithms. The 
concept of simultaneous bias correction and CNV 
detection can serve as a basis for combining read- 
depth with other types of information such as read- 
pair or split-read in a single analysis. A user-friendly 
and computationally efficient implementation of our 
method is freely available. 

INTRODUCTION 

Structural variation (SV), including copy-number vari- 
ation (CNV), is a major form of genetic variations in 



mammals (1^1). CNVs have been shown to affect gene 
expressions in human cell lines (5) as well as in different 
tissues of rodents (6-8), and to play an important role 
in the etiology of schizophrenia (9,10), autism (11,12) 
and non-psychiatric diseases (13-15). In functional 
genomic studies, failing to account for copy-number 
differences can lead to errors in ribonucleic acid 
sequencing, chromatin immunoprecipitation sequencing, 
DNase hypersensitive site mapping sequencing and 
formaldehyde-assisted isolation of regulatory elements 
sequencing (16,17). Thus, accurate detection for CNVs 
in a single genome is important. 

Microarray technologies were the main platform for 
initial work in CNV characterization (18,19) and remain 
a cost-efficient choice (20). High-throughput sequencing 
(HTS) technologies (21-23) promise a complete catalog 
of SV and could replace microarrays as a discovery 
platform (20,24). While microarray-based CNV detection 
analyzes probe hybridization intensities, HTS-based CNV 
detection uses conceptually distinctive approaches: read- 
pair, split-read and read-depth (RD) analyses, which vary 
in their sensitivity and specificity depending on the sizes 
and classes of SVs (1,20,24). Converging evidence suggests 
that multiple approaches should be considered together to 
maximize CNV detection from HTS data. For example, 
the 1000 Genomes Project (1000GP) used 19 algorithms to 
independently identify CNVs in 185 human genomes and 
pooled the results according to the specificity of each al- 
gorithm (1). Recent methods [SPANNER (1), CNVer (24) 
and Genome STRiP (25)] integrate read-pair and RD in 
the detection process in different ways. 

The RD approach looks for higher or lower than 
expected sequencing coverage in a genomic region to 
infer gain or loss of DNA. RD has been computed in a 
variety of ways, including counting the number of frag- 
ments (25) or reads (26-29) mapped to a particular 
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genomic region and calculating the sum of per-base 
coverage within a region (30,31). Existing CNV detection 
methods assume that RD follows a Poisson distribution 
(or a normal distribution as the large-sample approxima- 
tion of the Poisson model) for a diploid genome and 
search for regions that diverge from this distribution. 
However, in practice, neither sampling nor mapping of 
the reads is uniform because of experimental biases. GC 
content can lead to certain genomic regions being over- or 
under-sampled (22). Repetitive DNA elements are 
abundant in the mammalian genomes (32); consequently, 
the number of reads unambiguously mapped to a region 
could be very different from the number of reads 
sequenced from the region. Additional sources of bias, 
which are more difficult to trace (e.g. noise arising from 
sequencing, sequencing errors), create further variability 
in RD coverage. Violation of the assumed Poisson distri- 
bution entails loss of sensitivity /specificity to detect CNVs 
using RD. In studies of cancer, matched pairs of tumor- 
and normal-tissue samples may be used to correct biases 
by computing RD ratios (33-36); however, matched 
control samples are generally not available. 

Bias correction has not been adequately addressed in 
the literature. Some existing methods (26,28,31) adopt a 
two-step approach where RD is first smoothed for GC 
content differences using linear regression, and the 
GC-adjusted RD is then segmented. Other methods 
(25,29) account for mapping bias of a candidate region 
using its effective length (e.g. the number of confidently 
mapped bases); however, this approach does not account 
for the dependence between consecutive regions or add- 
itional sources of noise in the data. While various kinds of 
adjusted RD have been used as input data, nearly all 
methods (25,28-30) use the Poisson or normal- 
distribution assumption without subsequent evaluation 
the adequacy of the distributional assumption. 

In this study, we first show that evaluation of the distri- 
bution assumption for RD is important, as it may not hold 
true. Second, we show that it is important to jointly 
estimate copy number and the effect of confounding 
factors. Third, we develop a novel statistical method to 
accurately model RD and detect CNVs from HTS data. 
We measure RD by the number of sequence fragments 
mapped in sliding windows tiled along the genome, and 
we model the fragment counts by negative binomial (NB) 
distributions, which allow for over-dispersion and account 
for the effects of confounders. Furthermore, we account for 
the dependence of fragment counts of adjacent windows 
using a hidden Markov model (HMM). Known confound- 
ing factors are treated as covariates and corrected explicitly, 
while unknown experimental biases are accommodated by 
the over-dispersion parameter of the NB distribution and 
by an additional noise component via a mixture model. 
Fourth, we calibrate our method using simulation and 
whole-genome sequencing data from the 1000GP, and we 
compare our method with CNVnator (26), the 
best-performing RD-based CNV detection algorithm in 
the literature (1). Finally, to demonstrate the utility and 
robustness of our method, we apply our method to both 
human and mouse HTS data sets. 



In summary, our method outperforms existing 
RD-based CNV detection algorithms and distinguishes 
homozygous and heterozygous deletions and high-copy 
duplications. Our method complements the current litera- 
ture, and the concept of simultaneous bias correction and 
CNV detection can serve as a basis for combining RD 
with read-pair or split-read in a single analysis. A 
user-friendly and computationally efficient implementa- 
tion of our complete analytic protocol is freely available 
at https://sourceforge.net/projects/genseng/. 

MATERIALS AND METHODS 

Data sets included in this study 
1000GP data 

For method development and assessment, we used the 
whole-genome sequencing data from three HapMap indi- 
viduals sequenced as part of the 1000GP. These include 
the CEU parent-offspring trio of European ancestry 
(NA12878, NA12891, NA12892), sequenced to 42x 
coverage on average using the Illumina Genome 
Analyzer (I and II) platform. Sequencing reads were a 
mixture of single end and paired end with variable 
lengths (36 bp, 51 bp). The complete genome sequence 
data were obtained in the form of '.bam' alignment files 
from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_ 
data/data/. Reads were aligned to the human reference 
genome NCBI37 using BWA (37) (v.0.5.5) as described 
in the online documentation: ftp://ftp.ncbi.nlm.nih.gov/ 
1000genomes/ftp/README.alignment_data. 

High-confidence CNVs 

To assess the sensitivity of our discovery method, we used 
the high-confidence CNVs established for NA12878 
[Supplementary Table S6 of Mills et al. (1)] by combining 
CNVs reported in earlier surveys that used high-density 
microarrays (2,38,39), fosmid sequencing (40) or ABI 
tracing mapping (41). This data set included 610 deletions 
[~82% from microarray reports (2,38,39)] and 261 dupli- 
cations [100% from microarray reports (2,38,39)] from the 
autosomes of NA12878. The second high-confidence data 
set used in this study was generated by Handsaker et al. 
(25) by accurately genotyping deletions from the 1000GP 
HTS data (1). The complete data set was downloaded 
from ftp://ftp.broadinstitute.Org/pub/svtoolkit/misc/l kg/ 
NGPaper/ and included deletions for the three aforemen- 
tioned HapMap individuals. This data set included 2301 
deletions for NA12878, 2200 deletions for NA12891 and 
2055 deletions for NA12892. CNV coordinates reported in 
both high-confidence data sets were translated from 
NCBI36 to NCBI37 using liftOver. 

Data from other sequencing projects 

To demonstrate the robustness of our method, we used 
HTS data from two different studies using various sequ- 
encing platforms, sequencing depths and read lengths. In 
the first study, we used the whole-genome sequencing data 
of three individuals affected with bipolar disorder from a 
large multiplex Spanish pedigree currently under investi- 
gation. Paired-end sequencing with 100-bp reads was per- 
formed at the University of North Carolina on the 
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Illumina HiSeq 2000 platform. Each individual was 
sequenced to an average of 15x coverage. Reads were 
aligned to the human reference genome NCBI37 using 
BWA (37) (v.0.5.5) with default parameters. In the 
second study, we downloaded (ftp://ftp-mouse.sanger.ac. 
uk/curren t_bams/) the whole-genome sequencing data of 
inbred mouse strains made freely available by the Mouse 
Genomes Project conducted at the Sanger Institute (42). 
All mouse samples were sequenced on the Illumina GAII 
platform with a mixture of 54-, 76- and 108-bp paired 
reads to a coverage ranging from 17x to 43 x. Reads 
were aligned to the mouse reference genome NCBI37 
using the MAQ aligner (3,43). For this study, we 
analyzed the alignment files for 13 inbred strains 
(129SlSvlmJ, A/J, AKR/J, BALB/cJ, C3H/HeJ, CAST/ 
EiJ, CBA/J, DBA/2J, LP/J, NOD/LtJ, NZO/HILtJ, 
PWK/PhJ, WSB/EiJ). We also downloaded the released 
SV calls for these strains from ftp://ftp-mouse.sanger.ac. 
uk/current_svs/. These SVs have been classified into 
several categories based on specific paired-end mapping 
patterns briefly described in Yalcin et al. (3). From this 
SV release, we extracted 2 categories, including deletions 
and copy-number gains (GAINS and TANDEMDUP), to 
compare with the GENSENG-predicted calls. 

Reference genomes 

The human reference genome NCBI37 was obtained from 
f tp ://ftp . ncbi .nlm . nih . go v/ 1 OOOgenomes /ftp /technical /ref 
erence/human_glk_v37.fasta.gz. The mouse reference 
genome NCBI37 was obtained from ftp://ftp-mouse. 
sanger.ac.uk/ref/NCBIM37_um.fa. 

Input data preparation for CNV detection 

Our input data are a triplet of RD signal, GC content and 
mappability score computed in sliding windows tiled 
along the genome. 

RD signal 

Alignment (.bam) files are parsed out using SAMtools (44) 
with a quality control (QC) filter to extract confidently 
aligned reads (see Additional file 1, Supplementary 
Methods). A (single-end or paired-end) sequence read 
represents one or two ends of a DNA fragment 
randomly sampled from the donor/sample genome. 
Using reads passing the QC filter, the RD signal is 
calculated as the number of sequenced DNA fragments 
in sliding windows, ensuring each fragment is counted 
only once (Supplementary Methods). 

GC content 

First, we calculate the proportion of G or C bases in each 
window from a given reference genome. Then, we apply a 
cubic spline smoothing and transform the GC proportion 
based on the fitted curve so that the transformed GC pro- 
portion and the logarithm of the RD are linearly correlated. 
Finally, the transformed GC proportion is median centered 
and is referred to as GC content hereafter. 

Mappability score 

As a function of both reference sequence and read length 
(A'-mer), mappability score is calculated a priori in four 



steps: (i) for each base pair position in the genome, extract 
a K-mer from the reference genome, which consists K 
consecutive bases starting at this base pair position, 
(ii) Align the K-mers back to the reference genome using 
a desired aligner, e.g. BWA (37). Ideally, the aligner and 
the alignment parameters are chosen to match what was 
used for generating read alignment files from the sample 
genomes, (hi) Identify mappable base positions where the 
corresponding K-mers map back to themselves unambigu- 
ously (i.e. there is a single best hit and it is the true pos- 
ition of the K-mer). For example, the XO field produced by 
BWA (37) relates a K-mer from a specific place in the 
genome to the number of best hits of that K-mer in the 
entire genome. If a K-mer has a XO value of 1, the corres- 
ponding base can be identified as a mappable base, (iv) 
Compute mappability score as the proportion of 
mappable bases in a given window, which measures the 
uniqueness of specific regions of the reference genome. 

Window consideration 

The window size and the degree of overlap between them 
are adjusted for specific data. In this study, a window size 
of 500 bp with 200-bp overlap was chosen for all data sets 
for several reasons: first, the window size should be no less 
than the mean DNA fragment size of the sequencing 
library. Second, using a larger window size (e.g. 1 kb) or 
non-overlapping windows would decrease precision in 
defining CNV break points and miss CNVs that only par- 
tially span one window. Third, a higher degree of overlap 
introduces more inter-window correlation, which necessi- 
tates appropriate adjustment in modeling the RD signals. 

CNV detection method 

This triplet of data for each individual genome is input 
into an integrative HMM, which classifies each window to 
a copy-number state based on maximum a posteriori prob- 
ability, while simultaneously accounting for sources of 
bias. The state changes mark the predicted break points 
of CNVs. Below we present a short intuitive view of our 
method and the elements needed in HMM characteriza- 
tion; full details are given in the Supplementary Methods 
(see Additional file 1). 

Hidden states and transition probability 

While microarray analysis suffers from over-saturation at 
high copy numbers, HTS allows RD-based methods to 
determine high copy numbers with improved accuracy 
(27). The total number of hidden states is implemented 
as an input parameter of GENSENG and can be freely 
specified by users. Theoretically, the more copy-number 
states specified, the more accurate the model becomes. 
However, a number of practical issues must be considered. 
For example, specifying more states means longer 
computing time, and for some data sets, there may not 
exist sufficient regions from which to estimate parameters. 
For the HTS data sets used in this study, we assume seven 
hidden states representing copy numbers of 0, 1, 2, 3, 4, 5 
and 6 or more. For homozygous populations such as 
inbred mice, we assume four hidden states representing 
copy numbers of 0, 2, 4 and 6 or more. We collapse the 
duplications with six or more copies into one state because 
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they are difficult to distinguish because of both experimen- 
tal (reduced signal-to-noise ratio) and computational 
concerns (having few regions with very high RD signal). 

State transitions proceed from one window to the next 
according to a first-order time-homogeneous Markov 
process. The transition probability describes the probabil- 
ity of having a copy-number state change between two 
adjacent windows. Intuitively, the copy-number state is 
unlikely to change for nearby windows but is more 
likely to change for windows that are far apart. 

Emission probability 

The hidden copy-number states emit probabilistic outputs 
at each window, i.e. the observed RD signal representing 
integer-valued count data. In the absence of sources of 
bias, sequencing coverage is uniform across the genome 
such that the emission probability of RD could be 
modeled by a Poisson distribution with equal mean and 
variance. In the presence of sources of bias, sequencing 
coverage is not uniform, and the Poisson-distribution as- 
sumption fails. To account for biases, the emission prob- 
ability of RD is modeled as a mixture of uniform 
distribution and NB, expressed as the following: 



c/R m +(l - c) 
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where c is the mixing probability, o, is the RD signal for 
window t, fly is the mean RD for window t given state j, <pj 
is the over-dispersion parameter given state/ The uniform 
distribution has a density function 1/R m to model any 
random fluctuation of RD, where R m is treated as a 
known constant using the largest RD among all 
windows of the chromosome. When non-overlapping 
windows are used, the mean RD for each window, ix t j, is 
modeled by an NB regression model, where the predictors 
include copy-number state, GC content and mappability 
score. When overlapping windows are used, the observed 
RD is drawn from an autoregressive process; thus, a 
residual term is included as an additional predictor in 
the NB regression model assuming first-order 
autoregression. The additional noise in the data that 
cannot be explained by variability in GC content and 
mappability is accommodated by <j>j, the over-dispersion 
parameter of the NB distribution (allowing variance to be 
larger than mean) and the uniform distribution in the 
mixture model. 

Tuning parameters 

Given the HMM topology, the challenge lies in optimizing 
model parameters given the observed data, also known as 
HMM training. There are many parameters to be 
optimized. To reduce computational difficulty, we choose 
to specify a subset of HMM parameters based on previous 
knowledge and user preference, including the initial state 
probability, state transition probability and the mixing 
probability in emission probability. These tuning param- 
eters can be influential and should be chosen carefully. 
The remaining emission parameters, including the coeffi- 
cients and over-dispersion parameters in the NB regression 
model, are estimated for each data set. 



Parameter estimation 

The optimization problem is solved by the Baum-Welch 
algorithm (45), which maximizes the data likelihood for an 
individual chromosome in iterative steps, including initial- 
ization, expectation and maximization. In the initializa- 
tion step, we rely on intuitive guesses as well as 
empirical values. The initial emission parameters were 
estimated from the 1000GP and the Mouse Genomes 
Project data sets where known CNVs are available. 
These initial emission parameters are saved for the 
human and mouse genomes, respectively, and are used 
for any new sample without previous knowledge of its 
CNVs. In the maximization step, we obtain 
maximum-likelihood estimates of emission parameters. 
We apply a weighted NB regression model, where the 
weights are posterior probabilities for each window be- 
longing to a particular copy-number state, given the 
observed data of an entire chromosome. These weights 
represent current knowledge of the probabilistic classifica- 
tion of a window to copy-number state and are updated in 
the expectation step. While included as a predictor in the 
regression model, the copy number is the hidden variable 
to be inferred from the observed data. Intuitively, using 
posterior probability as regression weights, we are able to 
partition the observed RD across all hidden states, pro- 
portional to the likelihood. The weighted NB regression 
model is fitted by alternately estimating regression coeffi- 
cients using iteratively reweighted least squares and 
estimating the over-dispersion parameter using a 
Newton-Raphson method. In the expectation step, we 
update the forward, backward and posterior probability, 
given the current estimates from the maximization step. 
The expectation and maximization steps iterate until the 
convergence criterion (<10 -6 change in the log-likelihood) 
is reached. 

CNV calling 

Using the parameters at convergence, first we obtain the 
final estimates of the posterior probability for each 
window belonging to a particular state, given the 
observed data from the entire chromosome. Second, we 
assign the final estimate of copy number for each window 
using the state with the largest posterior probability. The 
state changes mark the predicted breakpoints of CNVs. 
The confidence score of a CNV region is computed as the 
sum of the posterior probabilities of all windows enclosed 
within the break points. Next, a two-step merging algo- 
rithm is carried out to refine the boundaries of the CNVs. 

Prioritization of CNV calls 

A CNV QC step can be applied to remove CNVs pre- 
dicted with the lowest confidence. We recommend 
removing predicted CNVs shorter than 800 bp (i.e. 
removing those that appear in only one window as 
shown in Supplementary Table S2), or predicted CNVs 
with an average mappability <0.3 (i.e. removing 
those that cannot be confidently predicted as shown in 
Figure lb). An additional prioritization approach was im- 
plemented via the RD-accessibility (RDA) statistic, which 
reflects the signal-to-noise ratio of a predicted CNV region 
after accounting for known confounders in RD. The term 
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Figure 1. Relationship between read-depth and mappability in high-confidence CNVs (1). (a) The boxplot of read-depth from windows mapped to 
the 610 high-confidence deletions (red) and 261 high-confidence duplications (blue), suggesting a similar read-depth distribution between deletions 
and duplications and no power in detecting CNVs. (b) The boxplot of read-depth stratified by mappability classes, color-coded such that darker 
shades reflect higher mappability. The labels of the x-axis indicate the CNV class (DEL: deletions; DUP: duplications) and mappability class. For 
example, label (DEL: 0.2-0.3) indicates windows from the high-confidence deletions and with mappability score ranging from 0.2 to 0.3. Within each 
mappability class, duplications show higher mean read-depth than deletions, suggesting that correction for mappability improves the ability to detect 
CNVs. Furthermore, when mappability falls below 0.3, read-depth distribution becomes increasingly similar between deletions and duplications, 
suggesting that the ability to detect CNVs in those regions is limited. For example, for windows with mappability ranging from 0.2 to 0.3, ~50% of 
windows in the duplication regions had read-depths equal to or lower than the average read-depth from compatible copy-normal regions; ~20% of 
windows in the deletion regions had read-depths equal to or higher than the average read-depth from compatible copy-normal regions. 



of RDA was first coined by Abyzov et al. (26), but for a 
different purpose. The RDA statistic is computed in three 
steps: (i) after CNV calling, identify all compatible 
copy-number-neutral windows the GC content and 
mappability scores of which are the same as those from 
the region of interest; (ii) calculate the average window 
counts from (i) as the expected RD for the region of 
interest; and (iii) obtain the RDA by dividing the 
observed RD by the expected RD for the region of 
interest. Using a copy number of two as normalization 
for copy-number-neutral autosomal regions, the theoret- 
ical signal-to-noise ratios are 0, 0.5, 1.5 and 2 for copy 
numbers of 0, 1, 3 and 4, respectively. Therefore, a 
region is considered to be RD accessible if its RDA 
value is <0.5 for homozygous deletions, <0.75 for hetero- 
zygous deletions and >1.25 for duplications. In general, 
we recommend removing CNVs predicted from regions 
that are not RD accessible (e.g. if its RDA values range 
between 0.75 and 1.25). In addition, we recommend 
ranking the predicted regions by their RDAs, where a 
higher signal-to-noise ratio reflects higher confidence 
that the predicted CNVs are correct; this is analogous to 
ranking by fold change in gene expression analysis. 

Performance assessment 

While the high-confidence data set compiled by Mills et al. 
(1) indicates where the true positive CNVs are for 



HapMap individuals, the true negatives are unknown. 
Therefore, we used two approaches to assess our 
method's performance. First, we conducted a simulation 
to estimate the sensitivity and specificity to detect CNVs. 
Then, we analyzed the high-coverage 1000GP trio data, 
where we estimated the sensitivity using high-confidence 
CNVs and used the total number of base pairs or calls as a 
surrogate measure for specificity. For comparison, we 
applied CNVnator (26) in parallel, using its recommended 
parameter setup and QC filter. The methodology differ- 
ences between GENSENG and CNVnator are detailed in 
Supplementary Table S5. The main differences are in bias 
correction and segmentation techniques. 

Simulation 

We simulated two data sets for performance assessment. 
The first simulation directly generated RD data (28). 
Using chromosome 1 from NA12878 as a template, we 
implanted 76 high-confidence CNVs (25 duplications 
and 51 deletions) (1) by assigning a copy number of four 
to any window that overlapped the duplications and a 
copy number of zero to any window that overlapped the 
deletions. All other windows were assigned to have a copy 
number of two. The covariate matrix (the assigned copy 
number, mappability score and GC content of each sliding 
window) and coefficient vector were passed to the 'garsim' 
function from R/gsarima to simulate the RD for each 
window. The 4 garsim' model we applied was the NB 
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distribution with the log link function, where the 
autoregressive parameter was set to 0.6, the zero correc- 
tion parameter was set to 'zql' and the inverse of the over- 
dispersion parameter was set to 0.01. 

The second simulation mimicked a sequencing experi- 
ment to generate paired-end reads from a CNV containing 
a hypothetical chromosome. To simulate reads, we used 
chromosome 1 of the reference human genome as a 
template and modified the template sequence based on 
the 76 high-confidence CNVs (51 deletions and 25 dupli- 
cations) (1). For any deletion, we removed the correspond- 
ing sequence of the deleted DNA, and for any duplication, 
we inserted an extra copy of the duplicated sequence. As a 
result, the implanted deletions were copy number 0 dele- 
tions, and the implanted duplications were copy number 4 
duplications. Among the 76 high-confidence CNVs, eight 
deletions and four duplications overlap with other dele- 
tions or duplications. Thus, 64 independent CNVs were 
implanted (43 deletions and 21 duplications) into the 
chromosome 1. Second, after the CNV-containing hypo- 
thetical chromosome was created, we applied the 
sequencing simulator, wgsim, as implemented in 
SAMTools (44) to generate 36-bp paired-end short 
reads. For wgsim simulation, the mean value of the 
outer distance between the two ends was set to 200, the 
standard deviation was set to 20 and the sequencing error 
model was the empirical error model of the Illumina 
sequencing platform. A total of 150 million paired-end 
reads were generated, which gave an average sequencing 
coverage of 40 x. Third, we used BWA (37) to map the 
reads to the unmodified reference human genome. The 
resulting alignment file was used as input to apply 
GENSENG and CNVnator (26). Among CNVs predicted 
by either approach, a true discovery was defined when a 
predicted CNV overlapped with at least 50% of a 
simulated CNV and had the same copy number. 

1000GP data 

We analyzed the high-coverage sequencing data for the 
CEU trio from the 1000GP. To facilitate the comparison 
between the predicted CNVs and the high-confidence 
CNVs, which only provide deletion and amplification 
calls rather than the particular copy number, we defined 
deletions as any GENSENG calls where the inferred copy 
numbers were 0 or 1, and duplications as any calls where 
the inferred copy numbers were >2. Sensitivity was 
calculated by dividing the number of total base pairs of 
the overlapping events (>l-bp overlap, or >50% recipro- 
cal overlap with the high-confidence CNVs) by the total 
number of high-confidence CNVs. 

Performance on low-coverage data 

The native coverage of both our simulated data and the 
1000GP high-coverage data is ~40x . To identify the lower 
bound that GENSENG can handle, we applied 
GENSENG to data with varying sequencing coverage 
and compared the performance with that based on the 
native coverage using the same evaluation metrics. First, 
we repeated the simulation process as described earlier, 
with the targeted coverage been set as 5x, 10 x, 20 x, 
30 x and 40 x. To test the consistency of our simulation, 



we also simulated data at 40 x coverage for 10 times and 
observed replicable results (data not shown). Second, 
using the DownsampleSam.jar tool from Picard (http:// 
picard.sourceforge.net), we down-sampled the high-cover- 
age 1000GP data from NA12878 and achieved a series of 
sequencing coverage of 5x, lOx, 20 x, 30 x and 40 x. 

RESULTS 

Evaluation of experimental biases in HTS 

Under idealized scenarios, HTS RD is expected to follow 
a Poisson distribution with variance equal to the mean. 
However, we found that the observed variance is much 
greater than the mean (Supplementary Figures Sla and 
S2a), indicating substantial deviation from the Poisson 
distribution. We found that some of the non-uniformity 
in RD is caused by genome-wide variability in mappability 
and GC content. Supplementary Figures Sib and S2b 
show a positive correlation between mappability and 
RD, where low mappability scores indicate a higher pro- 
portion of repetitive sequences, resulting in lower RD; 
high mappability scores indicate a higher proportion of 
unique sequences, resulting in higher RD. Supplementary 
Figures Sic and S2c show a non-linear relationship 
between GC content and RD, where sequences with 
extreme GC content (low or high) tend to have lower 
RD. In addition to the general trends observed across 
various data sets, we found that the curves of RD versus 
GC content varied from sample to sample. For example, 
the peak of the mouse sample slightly shifted to the right 
(Supplementary Figure S2c). The fractions of mappable 
bases in the genomes were found to be 80-90% and 
increased moderately as the read-length increased (Supple- 
mentary Table SI). Lastly, we found that the 
non-uniformity in RD could not be explained solely by 
GC content or mappability. We examined the RD distri- 
bution from compatible windows that had the same GC 
content, had the same mappability scores and were mostly 
likely copy-number normal (e.g. did not overlap any 
high-confidence CNVs or other candidate CNVs). Then 
we compared the observed distribution from these com- 
patible windows with the theoretical expectations. If GC 
content and mappability were the only sources of biases, 
the observed distribution should have closely followed a 
Poisson distribution. However, Supplementary Figure S3 
suggests that the Poisson distribution still fails because it 
restricts variance to equal the mean; in contrast, the NB 
distribution fits the data well because its over-dispersion 
parameter accommodates additional sources of noise in 
the data. 

To investigate how experimental biases impact 
copy-number inference, we examined the relationship 
between mappability and RD in high-confidence CNVs 
(610 deletions and 261 duplications) (1). Without account- 
ing for mappability, duplicated and deleted regions could 
not be distinguished because the RD distributions were 
very similar (Figure la). However, after stratification by 
mappability scores, the mean RDs became significantly 
different between duplicated and deleted regions within 
a mappability class, such that these CNVs could be 
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recovered (Figure lb). This observation suggests that it is 
important to jointly estimate copy number and the effect 
of confounding factors. Copy-number inference made 
without correcting for bias may lead to systemic errors. 
Furthermore, as shown in Figure lb, when mappability 
scores are extremely low (e.g. <0.3), too few reads can be 
confidently aligned to those regions, and consequently 
CNVs cannot be confidently predicted. This observation 
suggests that a CNV QC filter based on mappability score 
could be applied to reduce false-positive predictions. 

The GENSENG method 

To mitigate the effects of experimental bias and improve 
RD-based CNV detection, we developed a novel statistical 
method called GENSENG. The unique feature of 
GENSENG is to integrate the correction of multiple 
sources of bias and the inference of the copy-number 
states in a single analysis. Figure 2 gives an algorithmic 
overview of GENSENG. The tasks of input preparation 
were implemented in R, perl and python programming 
languages. The computational core of GENSENG was 
implemented in C++. Recommendation for the tuning 
parameters and initial emission parameters are provided 
as part of the software release. Given the input, 
GENSENG can report CNVs from an ~40x human 
chromosome within a couple of hours. 

The methodological details of GENSENG are 
described in the 'Methods' section and in the 
Supplementary Methods (see Additional file 1). Briefly, 
the key components of GENSENG are summarized 
below. First, we used an HMM with seven states (0-6) 
for modeling of copy number. In contrast, existing 
methods find only two general types of copy numbers, 
i.e. ioss/deletion' and 'gain/duplication'. The support for 
the seven-state modeling strategy is evident from examples 
shown in Figure 3, where GENSENG correctly recovered 
high-confidence CNVs and identified their status as homo- 
zygous deletion, heterozygous deletion and multi-allelic 
duplications. Some discrepancies in the boundaries 
between the predicted CNVs and the high-confidence 
CNVs were observed, reflecting technological differences 
between HTS and microarrays. 

Second, we used an NB regression model for RD and 
included known confounders as covariates, such that their 
effects on RD were removed. The emission probability of 
RD was made even more robust against RD outliers using 
a mixture model of NB and uniform distributions, such 
that any additional biases in the data could be modeled by 
the NB over-dispersion parameter and the uniform distri- 
bution. These modeling strategies permit simultaneous 
bias correction and CNV detection. Some of the benefits 
of such simultaneous analysis is illustrated in Figure 3e, 
which demonstrates good sensitivity for detecting duplica- 
tion from a noisy region with a medium mappability score 
(0.58) after accounting for mappability and additional 
noises. 

Multiple techniques were introduced in GENSENG, 
including correcting for GC content and mappability, 
modeling autoregression, fitting a mixture of NB and 
uniform distributions and applying QC to prioritize 



CNV calls (see 'Methods' section). To study the effects 
of these techniques on CNV detection and identify the 
best-fitting model, we examined the sensitivity and the 
number of CNV calls made by different partial versions 
of our method (Supplementary Table S2). We found that 
correcting for GC content alone was not sufficient; and 
further correcting for mappability resulted in the most 
substantial improvement, with gains in both sensitivity 
and specificity. QC including both size and RDA filters 
substantially improved specificity, with minimal loss in 
sensitivity. The best GENSENG model was selected 
based on these results and included all aforementioned 
techniques. 

Performance assessment and comparison 

To benchmark the performance of GENSENG, we 
carried out two sets of simulations. In the first simulation, 
we applied GENSENG to simulated RD data 
(sequence-fragment counts across tiled windows) for a 
single chromosome. For 76 simulated CNVs, we 
observed 88% sensitivity and 100% specificity. The re- 
maining 12% simulated CNVs (three duplications and 
six deletions) did not pass CNV QC filters (RDA filter 
and mappability <0.3). 

In the second simulation, we applied both GENSENG 
and CNVnator to the simulated sequence reads (instead of 
window read counts) for a single chromosome, and we 
compared the sensitivity and specificity between 
GENSENG and CNVnator (Table 1). Before any CNV 
QC filter was applied, GENSENG produced 1478 fewer 
false CNV calls (69% fewer false deletions and 56% fewer 
false duplications), demonstrating a better specificity. A 
total of 12 simulated CNVs (10 deletions and two dupli- 
cations) were not detected by GENSENG, and they were 
missed because they were smaller than the minimum size 
of CNVs detectable by GENSENG (i.e. <800bp), or had 
mappability <0.3 (i.e. unreliable regions). After the rec- 
ommended CNV QC filters were applied (RDA filter for 
GENSENG and the default qO filter for CNVnator), 
GENSENG outperforms CNVnator in both sensitivity 
and specificity. Specifically, for 43 simulated deletions, 
GENSENG had 77% sensitivity, 7% higher than 
CNVnator. For 21 simulated duplications, GENSENG 
had 90% sensitivity, 33% higher than CNVnator. The 
specificity for duplications was 100% for both 
GENSENG and CNVnator. Both methods made false 
deletion discoveries, but GENSENG had better specificity 
(1328 fewer false deletions, 12% lower false discovery rate 
(FDR) than CNVnator). Increasing the stringency of the 
CNV QC filter by removing CNVs with mappability <0.3 
further improved GENSENG'S specificity (9% FDR for 
deletions, or 43% lower FDR than CNVnator based on 
the same stringent CNV QC filter), while maintaining its 
good sensitivity (67% for both deletions and duplications, 
or 15% higher sensitivity than CNVnator). This result and 
the result from Figure lb suggest the usefulness of the 
mapability filter. In Supplementary Figure S4, we show 
that ranking the RDA statistic (i.e. signal-to-noise ratio 
after accounting for confounders) computed for each 
CNV is an effective approach to correctly prioritize the 
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Figure 2. Overview of GENSENG inference framework. The required input contains two parts: the triplet data (read-depth, GC content and 
mappability score) and the initial parameter values. The input is passed to the GENSENG engine for parameter training based on the Baum- 
Welsh algorithm. To update the emission probability and the parameters for the negative binomial regression model, the weighted generalized linear 
model (GLM) fitting algorithm is applied iteratively, which uses the updated posterior probability of the copy-number state as the regression weights 
in each iteration. At the convergence of parameter training, GENSENG identifies the state with the largest posterior probability and assigns the 
associated copy number to the corresponding window. Finally, GENSENG outputs the coordinates of CNV segments and the confidence scores. 



prediction made by GENSENG or CNVnator. Most re- 
maining false-positive deletions in the CNVnator data set 
were likely influenced by multiple sources of bias, and the 
52% FDR is a reasonable estimate for CNVnator based 
on the experimental validation conducted by the 1000GP 
(reported to range from 14.3 to 74.1%) (1,46). In 
summary, the simulation studies demonstrate that 
GENSENG outperforms CNVnator, suggesting that 
integrating bias correction from multiple sources and 
copy-number inference is a desired strategy for RD- 
based CNV detection. 

To further evaluate GENSENG's performance, we 
analyzed the 1000GP data. First, we applied both 
GENSENG and CNVnator to the high-coverage HTS 
data for NA12878 and focused on calling autosomal 
CNVs (Table 2). Sensitivity was estimated by comparing 
the predicted CNVs to the high-confidence CNVs from 
Mills et al. (1) (610 deletions and 261 duplications). 
GENSENG gave an overall sensitivity of 56% (73% for 
deletions and 17% for duplications) using the 50% recip- 
rocal overlap criterion. In contrast, CNVnator gave a 
lower overall sensitivity of 50% (64% for deletions and 
16% for duplications) using the same criterion. 
Approximately 87% of the high-confidence CNVs were 
obtained from high-density microarrays based on 
changes in probe intensities (2,38,39). For these CNV 
regions, the evidence for changes in RD may or may not 
be observed from HTS data (26) (also see 'Discussion' 
section), which could be predicted by the RDA statistic. 
Similarly to Abyzov et al. (26), we found that -76% (462 
of 610) high-confidence deletions were RD accessible (i.e. 
RDA < 0.75) from the HTS data, whereas only ~21% 
high-confidence duplications (53 of 261) were RD 



accessible (i.e. RDA > 1.25). Given these observations, 
we then recomputed sensitivity by comparing the pre- 
dicted CNVs with the high-confidence CNVs that are 
RD accessible (462 deletions and 53 duplications), which 
yielded an overall sensitivity of 90% for GENSENG and 
an overall sensitivity for 79% for CNVnator [the 79% 
sensitivity is similar to that reported by the authors of 
CNVnator (26)]. The high-confidence CNV set (1) does 
not provide information on the true negatives needed to 
assess specificity; thus, we focused on calibrating sensitiv- 
ity as described above and used the volume (i.e. the total 
number and total base pairs) of the predicted CNVs as a 
surrogate measure of specificity. We found that the pre- 
dicted volumes are comparable between the two methods. 

Then, we applied GENSENG and CNVnator to the 
1000GP HTS data from the CEU trio (NA12878, 
NA12891, NA12892), and we evaluated the sensitivity 
for detecting deletions as compared with the deletions 
from Handsaker et al. (25), which represent the 
combined CNV calls from 1000GP (1). We found that 
an average of 49% deletion calls from Handsaker et al. 
(25) intersected with GENSENG calls (Supplementary 
Table S3). In contrast, we found an average of~37% 
intersected with CNVnator calls. The deletions reported 
in Handsaker et al. (25) were derived from the results from 
the 19 algorithms used by the 1000GP (1) and contained 
many deletions that are smaller than the minimum size of 
CNVs (<800bp) detectable by GENSENG. For 
NA12878, 89% of deletions (1026 of 1148) from 
Handsaker et al. (25) missed by GENSENG were due to 
the size. Similarly, for NA12891, 51% of deletions (547 of 
1072) were missed owing to the size, and for NA 12892, 
51% of deletions (523 of 1030) were missed owing to the 
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Figure 3. Example high-confidence CNVs predicted by GENSENG from the NA 12878 HTS data. Each subfigure (a-e) has four panels from top to 
bottom, and the x-axis of each subfigure indicates genomic position in base pairs. In the first panel, the black dots on the j'-axis indicate read-depth 
signal; red dashed lines are boundaries from GENSENG prediction; green solid lines are boundaries reported in the high-confidence CNV set (1); 
and grey lines are the median read-depth of the chromosome. The GC content and mappability of the region are plotted in the second and the third 
panels respectively. The fourth panel shows the locations of segmental duplication (purple, from the UCSC hgl9 segmental duplication track) and 
repetitive DNAs (orange, from the UCSC hgl9 repeatmask track). Shown here are a homozygous deletion (a), a heterozygous deletion (b), a simple 
and large duplication (c) and a complex duplication (d) that was predicted to be copy number 6+ and was right flanked by a large region with a 
median mappability of 0.2. Finally, (e) shows a duplication predicted to be copy number 4 from a noisy region with a median mappability of 0.58, 
illustrating good sensitivity for detecting duplications using simultaneous bias correction and copy-number inference. 
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size. In summary, the analyses of the 1000GP data confirm 
that GENSENG have better detection sensitivity than 
CNVnator and suggest similar or better specificity. 

Finally, we demonstrate that, as expected, higher 
sequencing coverage improves CNV detection power, 
and that the lower bound of sequencing coverage that 
yields reasonably good performance of GENSENG is 
10 x (Supplementary Tables S6 and S7). GENSENG 
could potentially work on data sequenced to as low as 
5x but with much reduced sensitivity (decreased by 33% 
for deletions and decreased by 10% for duplications, 
Supplementary Tables S6 and S7). 

Application to other HTS data 

As a proof of concept, we applied GENSENG to 
whole-genome HTS data from human and mouse 
samples and evaluated the validity of its prediction using 
an allele-sharing principle as well as additional genetic 
information available in our data. By allele sharing, we 
mean the following. From a sequencing study, genetic mu- 
tations can be readily detected with a broad spectrum of 
allele frequency, ranging from singleton variants that are 
unique to individual genomes to variants observed in 
multiple genomes. A variant shared among multiple 
genomes could arise from the inheritance of the same an- 
cestral allele, i.e. identity by descent (IBD), such that 
shared variants could receive higher detection confidence. 
The idea of searching for shared variation to increase the 
power of CNV detection was previously explored by 
Handsaker et al. (25) in the 1000GP samples from 
low-coverage population-scale sequencing. In our study, 
we first predicted CNVs from individual genomes and 
then identified shared CNVs that could arise from IBD 
as an evaluation of GENSENG's performance. 

The three human individuals affected by bipolar 
disorder that we examined were cousins, and therefore, 
they were expected to share ~1.5% of their genomes 
(1.5% IBD). If a genomic region is IBD, we expect to 
see a similar RD pattern for that region in each individual 
genome and in the pooled reads from all individuals. If the 
IBD region contains a true CNV, this CNV could be 
detected based on higher or lower than expected RD 
using either individual alignment files or the pooled align- 
ment file. Known confounders such as genomic GC 
content and mappability could also create similarity in 
RD pattern across different genomes and consequently 
predict CNVs that are shared among them. However, 
because GENSENG accounts for these confounding 
factors while inferring copy-number states, shared CNVs 
that arise from such artifacts have been minimized. In 
summary, GENSENG identified 831 candidate CNVs 
that are shared among the three cousins. Shared CNVs, 
especially those unique to this pedigree, could indicate an 
enrichment of high-risk disease alleles, and these CNVs are 
reported elsewhere. To illustrate the utility of GENSENG, 
Supplementary Figure S5a shows an example of shared 
deletion, also included in the 1000GP SV release (1,25). 
This suggests that alleles segregated in the general popula- 
tion at an appreciable frequency (e.g. >1% in 1000GP 
samples) would generally be shared among multiple 



individuals sequenced (25). In contrast, Supplementary 
Figure S5b shows an example of a singleton duplication 
that warrants further experimental validation. 

Similarly, we examined shared CNVs in the mouse 
genome. A genome-wide haplotype and IBD map has 
been established in 100 classical mouse strains using 
high-density single-nucleotide polymorphism(SNP) geno- 
types (47). Strains belonging to the same haplotype in a 
genomic region had >99% sequence identity and were 
considered IBD over that interval (47). Supplementary 
Figure S6 shows two examples of shared CNVs that 
stem from IBD. In addition, for the mouse strains, we 
compared the deletions and duplications predicted by 
GENSENG with those predicted by the Mouse Genomes 
Project (3,30). We found that the overall concordance rates 
ranged from 3 to 46% (Supplementary Table S4). A 
similar range of concordance was observed by the 
1000GP by comparing CNV call sets generated by 19 al- 
gorithms. Furthermore, compared with the algorithms 
used by the Mouse Genomes Project (3,30), we found 
that GENSENG had higher sensitivity for detecting dupli- 
cations and comparable sensitivity for deletions 
(Supplementary Table S4). We note that the improved sen- 
sitivity could be credited to GENSENG's bias-correction 
ability, which was absent in the approaches used by the 
Mouse Genomes Project (3,30); this improved sensitiv- 
ity warrants further experimental validation of the 
GENSENG-predicted duplications for the mouse strains. 



DISCUSSION 

We have developed a novel method, GENSENG, for de- 
tecting copy-number gain and loss from HTS data. One 
unique feature and a key advantage of our method is the 
ability to simultaneously correct for multiple sources of 
bias and infer CNVs from RD. The concept of simultan- 
eous bias correction and CNV inference can serve as a 
basis for combining RD with read-pair or split-read in a 
single analysis. 

The GENSENG method can be applied to whole- 
genome sequencing data using either single-end or paired- 
end reads or a mixture of the two. It does not require 
matched control genomes, and it does not rely on 
evidence from multiple individuals. The smallest CNVs 
that can be detected by GENSENG are 800 bp, and 
discrete copy numbers (0, 1, 2, 3, 4, 5 and 6+) are 
reported. Based on extensive benchmarking, GENSENG 
provides a better sensitivity-specificity profile than the pre- 
viously best-performing RD-based algorithm, CNVnator 
(1,26), when applied to high-coverage HTS data. We 
have also demonstrated that our method works on both 
human and mouse samples with lower coverage (15 x). 

In the current implementation of GENSENG, the top 
priority has been efficient and accurate detection of simple 
CNVs. We used reads with unambiguous mapping to 
compute RD signal, which reduced the method's sensitiv- 
ity to detect complex CNVs within repetitive regions. A 
number of specialized algorithms have been developed to 
reconstruct CNVs in repeat-rich regions by considering all 
alignment positions (31,48-51). We used a window size of 
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500-bp with 200-bp overlap to compute RD, which limits 
the break point resolution. Currently, we are developing a 
refinement pipeline to locally assemble the reads at the 
predicted break points to define break points at base 
pair resolution. This feature will be available in the 
future release of our software. 

Our likelihood-based method can be readily extended to 
incorporate all the sequence reads and the mapping uncer- 
tainty. In addition, we can incorporate other types of in- 
formation, such as haplotype, read-pair, split-reads and 
allele-specific RD, that can infer allele-specific copy 
number. Allele-specific RD can be informative for CNV 
calling. For example, in one window, if we observe ~50 
reads from paternal allele and 100 reads from maternal 
allele, a reasonable guess is that the ratio of the number 
of maternal allele versus paternal allele is 1:2, which will 
favor copy number 3 (one paternal + two maternal), or 
copy number 6 (two paternal + four maternal) etc. The 
allele-specific RD can be incorporated as part of the 
emission probability, e.g. using a beta-binomial distribu- 
tion similar to the setup of the B-allele frequency following 
the genoCN method (52). 

Not all the CNVs can be detected by RD data. On 
examining the high-confidence CNVs, we found that 
~76% of high-confidence deletions and only ~21% of 
high-confidence duplications were RD accessible from 
the 1000GP HTS data using 36-bp and 51-bp reads. The 
percentage of RD-accessible regions may increase for 
longer reads and when we incorporate reads that are 
mapped to multiple locations in the genome. In contrast, 
the high-confidence CNVs may be inaccurate. For 
example, undetected CNVs in a reference individual can 
lead to mistaken copy-number calls in the study samples 

(26.28.53) . Overall, we recommend RDA ranking as an 
effective way for prioritizing the CNVs predicted by 
GENSENG because it reflects the strength of the RD 
signal after accounting for confounders. 

Duplications are generally more challenging to detect 
than deletions by RD-based methods for several reasons 

(20.26.54) . First, the RD distribution (Poisson or NB) 
suggests that the higher the RD signal, the larger the 
signal variance. As expected, RD-based methods suffer 
reduced sensitivity in the detection of duplications 
(higher variance) compared with deletions (lower 
variance). Second, as mentioned in the proceeding para- 
graph, the proportion of RD-accessible high-confidence 
duplications is much less than that for deletions (21% 
versus 76%), thus reducing the sensitivity (Table 2). 
Lastly, as noted by Abyzov et al. (2011) (26), abnormally 
high RD signal may not necessarily represent a true du- 
plication but rather the effect of an 'unknown reference'. 
QC procedures that are aimed to reduce such false-posi- 
tive duplications (e.g. removing windows that are RD 
outlier or have any overlap with known genomic gaps) 
would lead to reduced sensitivity for duplications overall. 
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